-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: fixed .str.contains(..., na=False) for categorical series #22170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Failure is at https://travis-ci.org/pandas-dev/pandas/jobs/411345500#L2287-L2289 Basically, you're fix only works if there are missing values. This may fix it, but may break other things diff --git a/pandas/core/strings.py b/pandas/core/strings.py
index 8097ed196..06fd1391e 100644
--- a/pandas/core/strings.py
+++ b/pandas/core/strings.py
@@ -1855,7 +1855,12 @@ class StringMethods(NoNewAttributesMixin):
# before the transformation...
if use_codes and self._is_categorical:
result = take_1d(result, self._orig.cat.codes)
- result[isna(result)] = na
+ missing = isna(result)
+
+ if missing.any():
+ result_type = np.result_type(result, na)
+ result = result.astype(result_type, copy=False)
+ result[isna(result)] = na
if not hasattr(result, 'ndim') or not hasattr(result, 'dtype'):
return result |
Hello @pulkitmaloo! Thanks for updating the PR.
Comment last updated on September 24, 2018 at 01:15 Hours UTC |
Looks like a maybe unrelated CI failure. Can you merge master and push again? |
And can you ensure (and test) that |
Codecov Report
@@ Coverage Diff @@
## master #22170 +/- ##
=========================================
Coverage ? 92.19%
=========================================
Files ? 169
Lines ? 50950
Branches ? 0
=========================================
Hits ? 46975
Misses ? 3975
Partials ? 0
Continue to review full report at Codecov.
|
… into categorical_bug
ci/circle-27-compat.yaml
Outdated
@@ -7,7 +7,7 @@ dependencies: | |||
- cython=0.28.2 | |||
- jinja2=2.8 | |||
- numexpr=2.4.4 # we test that we correctly don't use an unsupported numexpr | |||
- numpy=1.9.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you revert this (and the rest of the .yaml)
… into categorical_bug
@jreback Thanks for the feedback, I have updated the PR as you requested. However, the changes in yaml files leads to circleci failure. |
@pulkitmaloo you need to rebase on master |
… into categorical_bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, can you add a whatsnew note in bug fixes
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -704,7 +704,7 @@ Numeric | |||
Strings | |||
^^^^^^^ | |||
|
|||
- | |||
- Bug in :class:`StringMethods` where `str_contains()` was not filling missing values with given argument for na for a categorical Series (:issue:`22158`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use double backticks on Series, can you use the actual reference here, e.g. :func:`Series.str.contains`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Thanks, I've updated it. Please let me know if there's anything else you'd like to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Can you please review and merge this PR.
Co-Authored-By: pulkitmaloo <[email protected]>
pandas/tests/test_strings.py
Outdated
values = Series(["a", "b", "c", "a", np.nan], dtype="category") | ||
result = values.str.contains('a', na=True).astype(object) | ||
expected = Series([True, False, False, True, True], dtype=np.object_) | ||
assert isinstance(result, Series) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be removed. The assert_series_equal checks that.
pandas/tests/test_strings.py
Outdated
assert res.loc[2] == "foo" | ||
# na for category | ||
values = Series(["a", "b", "c", "a", np.nan], dtype="category") | ||
result = values.str.contains('a', na=True).astype(object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the .astype(object)
. What's the dtype when you don't do that? Categorical? If so, I'd rather we document that as part of the API and test it.
pandas/tests/test_strings.py
Outdated
|
||
# na for objects | ||
values = Series(["a", "b", "c", "a", np.nan]) | ||
result = values.str.contains('a', na=True).astype(object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment for the astype object.
@pulkitmaloo are you able to update? |
@TomAugspurger should be fixed up. |
thanks @pulkitmaloo |
…fixed * upstream/master: DOC: more consistent flake8-commands in contributing.rst (pandas-dev#23724) DOC: Fixed the doctsring for _set_axis_name (GH 22895) (pandas-dev#22969) DOC: Improve GL03 message re: blank lines at end of docstrings. (pandas-dev#23649) TST: add tests for keeping dtype in Series.update (pandas-dev#23604) TST: For GH4861, Period and datetime in multiindex (pandas-dev#23776) TST: move .str-test to strings.py & parametrize it; precursor to pandas-dev#23582 (pandas-dev#23777) STY: isort tests/scalar, tests/tslibs, import libwindow instead of _window (pandas-dev#23787) BUG: fixed .str.contains(..., na=False) for categorical series (pandas-dev#22170) BUG: Maintain column order with groupby.nth (pandas-dev#22811) API/DEPR: replace kwarg "pat" with "sep" in str.[r]partition (pandas-dev#23767) CLN: Finish isort core (pandas-dev#23765) TST: Mark test_pct_max_many_rows with pytest.mark.single (pandas-dev#23799)
git diff upstream/master -u -- "*.py" | flake8 --diff